Phoneme lattice based texttiling towards multilingual story segmentation
نویسندگان
چکیده
This paper proposes a phoneme lattice based TextTiling approach towards multilingual story segmentation. The phoneme is the smallest segmental unit in a language and the number of phonemes in a language is usually far smaller than the number of words. Furthermore, many phonemes are shared by different languages. These properties make phonemes particularly appropriate for representing multilingual speech. As phoneme recognition is far from perfect, phoneme lattices, which carry much richer statistics than the 1-best hypotheses, are adopted in this paper as the input to the TextTiling approach. The term frequencies used in traditional TextTiling are replaced by the expected counts of phoneme n-gram units calculated from phoneme lattices. Experiments on TDT2 English and Mandarin corpora show that the phoneme lattice based TextTiling outperforms the phoneme 1-best based TextTiling and word based TextTiling in broadcast news story segmentation.
منابع مشابه
Probabilistic Latent Semantic Analysis for Broadcast News Story Segmentation
This paper proposes to perform probabilistic latent semantic analysis (PLSA) for broadcast news (BN) story segmentation. PLSA exploits a deeper underlying relation among terms beyond their occurrences thus conceptual matching can be employed to replace literal term matching. Different from text segmentation, lexical based BN story segmentation has to be carried out over LVCSR transcripts, where...
متن کاملMulti-Scale TextTiling for Automatic Story Segmentation in Chinese Broadcast News
This paper applies Chinese subword representations, namely character and syllable n-grams, into the TextTiling-based automatic story segmentation of Chinese broadcast news. We show the robustness of Chinese subwords against speech recognition errors, out-of-vocabulary (OOV) words and versatility in word segmentation in lexical matching on errorful Chinese speech recognition transcripts. We prop...
متن کاملSpoken and Written News Story Segmentation Using Lexical Chains
In this paper we describe a novel approach to lexical chain based segmentation of broadcast news stories. Our segmentation system SeLeCT is evaluated with respect to two other lexical cohesion based segmenters TextTiling and C99. Using the Pk and WindowDiff evaluation metrics we show that SeLeCT outperforms both systems on spoken news transcripts (CNN) while the C99 algorithm performs best on t...
متن کاملShot Boundary Determination on MPEG Compressed Domain and Story Segmentation Experiments for TRECVID 2003
KDDI R&D Laboratories has been participating in the past TREC conferences for text retrieval tasks. In this year we are newly participating in TRECVID 2003, namely the shot boundary determination and story segmentation tasks. In shot boundary determination task, we applied our proprietary shot segmentation algorithm originally proposed in [1] and slightly upgraded for this task. In our methods,...
متن کاملSeLeCT: a lexical cohesion based news story segmentation system
In this paper we compare the performance of three distinct approaches to lexical cohesion based text segmentation. Most work in this area has focused on the discovery of textual units that discuss subtopic structure within documents. In contrast our segmentation task requires the discovery of topical units of text i.e. distinct news stories from broadcast news programmes. Our approach to news s...
متن کامل